Coresets for k-Segmentation of Streaming Data
نویسندگان
چکیده
Life-logging video streams, financial time series, and Twitter tweets are a few examples of high-dimensional signals over practically unbounded time. We consider the problem of computing optimal segmentation of such signals by a k-piecewise linear function, using only one pass over the data by maintaining a coreset for the signal. The coreset enables fast further analysis such as automatic summarization and analysis of such signals. A coreset (core-set) is a compact representation of the data seen so far, which approximates the data well for a specific task – in our case, segmentation of the stream. We show that, perhaps surprisingly, the segmentation problem admits coresets of cardinality only linear in the number of segments k, independently of both the dimension d of the signal, and its number n of points. More precisely, we construct a representation of sizeO(k log n/ε) that provides a (1+ε)approximation for the sum of squared distances to any given k-piecewise linear function. Moreover, such coresets can be constructed in a parallel streaming approach. Our results rely on a novel reduction of statistical estimations to problems in computational geometry. We empirically evaluate our algorithms on very large synthetic and real data sets from GPS, video and financial domains, using 255 machines in Amazon cloud.
منابع مشابه
Coresets for k-Segmentation of Streaming Data Supplementary Material
In this supplementary material we detail the construction, properties, and proofs for a k-segment mean coreset that allows efficient segmentation of high-dimensional signals. We define the ksegment mean problem in Section 2. We describe a coreset for the 1-segment mean in Section 3. We show why a similar construction is not possible for the k-segment mean problem in Section 4. In Sections 5,6,7...
متن کاملCore-Preserving Algorithms
We define a class of algorithms for constructing coresets of (geometric) data sets, and show that algorithms in this class can be dynamized efficiently in the insertiononly (data stream) model. As a result, we show that for a set of points in fixed dimensions, additive and multiplicative ε-coresets for the k-center problem can be maintained in O(1) and O(k) time respectively, using a data struc...
متن کاملTurning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering
We prove that the sum of the squared Euclidean distances from the n rows of an n×d matrix A to any compact set that is spanned by k vectors in R can be approximated up to (1+ε)-factor, for an arbitrary small ε > 0, using the O(k/ε)-rank approximation of A and a constant. This implies, for example, that the optimal k-means clustering of the rows of A is (1+ε)approximated by an optimal k-means cl...
متن کاملOn k-Median clustering in high dimensions
We study approximation algorithms for k-median clustering. We obtain small coresets for k-median clustering in metric spaces as well as in Euclidean spaces. Specifically, in IR, those coresets are of size with only polynomial dependency on d. This leads to a (1 + ε)-approximation algorithm for kmedian clustering in IR, with running time O(ndk + 2 O(1) dn), for any σ > 0. This is an improvement ...
متن کاملStreaming Algorithms for k-Means Clustering with Fast Queries
We present methods for k-means clustering on a stream with a focus on providing fast responses to clustering queries. When compared with the current state-of-the-art, our methods provide a substantial improvement in the time to answer a query for cluster centers, while retaining the desirable properties of provably small approximation error, and low space usage. Our algorithms are based on a no...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014